1.1 Business task
HA shop wants to find new opportunities to grow its business. For that reason, we are going to analyze information about its current customers utilizing the products offered by the company. The mission is to find trends in the data and make useful recommendations for the company.
1.2 Ask Question
What are the trends identified?
How could these trends help influence HA sale and marketing strategy?
### STEP 2: Prepare
2.1. ROCCC
* ROCCC is a method that we can use to check the quality of the data we have.
* Reliable — High — Reliable as it is collected directly from HA’s customers.
* Original — High — First party provider
* Comprehensive — Hight — Parameters match most of Bellabeat products’ parameters
* Current — High — Data is just updated
* Cited — High — Data collected from first party
2.2. Loading Packages
install.packages('tidyverse', repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/1r/8h_kl20n6qz_d2dwdpblvmm00000gt/T//RtmpLe0VNs/downloaded_packages
install.packages('ggplot2', repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/1r/8h_kl20n6qz_d2dwdpblvmm00000gt/T//RtmpLe0VNs/downloaded_packages
install.packages('ggthemes', repos ="http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/1r/8h_kl20n6qz_d2dwdpblvmm00000gt/T//RtmpLe0VNs/downloaded_packages
install.packages('plyr', repos ="http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/1r/8h_kl20n6qz_d2dwdpblvmm00000gt/T//RtmpLe0VNs/downloaded_packages
library(tidyverse) #helps wrangle data
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.3 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2) #helps visualize data
library(ggthemes)
library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
library(dplyr)
library(skimr)
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggrepel)
getwd() #displays my working directory
## [1] "/Users/jn/Assignment"
setwd("//Users/jn/Assignment")
2.3. Importing data set
sep_order_list <- read.csv("HA-Order-List-21 Sep.csv")
3.1 Viewing the datasets
Take a look at the sep_order_list data.
head(sep_order_list)
Identify all the column in the sep_order_list data.
colnames(sep_order_list)
## [1] "order_id" "customer_fullname" "customer_email"
## [4] "order_status" "no_of_items" "Order_total"
## [7] "Billing" "Shipping" "Shipping_Method"
## [10] "Tracking_id" "Time_of_order_gmt" "line_items"
## [13] "total_discount" "date_paid" "date_paid_gmt"
3.2. Clean and format
Now that we got to know more about our data structures we will process them to look for any errors and inconsistencies. Veryfying number of customers:
n_unique(sep_order_list$customer_email)
## [1] 6421
We will now look for any duplicates:
sum(duplicated(sep_order_list))
## [1] 0
We want to ensure that column names are using right syntax and same format. We are changing the format of all columns to lower case.
sep_order_list_v2 <- clean_names(sep_order_list)
rename_with(sep_order_list,tolower)
We change format of Time_of_order_gmt to date and time format
library(lubridate)
sep_order_list_v2 %>%
mutate(time_of_order_gmt = lubridate::as_datetime(time_of_order_gmt))
Create new column Province by extract Province from Billing
library(tidyverse)
territory_pattern = "Alberta|British Columbia|Manitoba|New Brunswick|Newfoundland and Labrador| Northwest Territories|Nova Scotia|Nunavut|Ontario|Prince Edward Island|Quebec|Saskatchewan|Yukon"
sep_order_list_v3 <- sep_order_list_v2 %>%
mutate(province = str_extract(billing, territory_pattern))
Calculate total orders based on order status
table(sep_order_list_v3$order_status)
##
## cancelled completed on-hold processing
## 234 8076 76 2
In September, we have total 8076 completed orders. We only count completed orders to analyze sale revenue. So, we need to filter completed orders by order status.
sep_order_list_v4 <-sep_order_list_v3 %>% filter(order_status == "completed")
Convert “Order_total” to numeric so we can run calculations on the data
sep_order_list_v4$order_total <- as.numeric(as.character(sep_order_list_v4$order_total))
is.numeric(sep_order_list_v4$order_total)
## [1] TRUE
summary(sep_order_list_v4$order_total)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 150.1 168.0 203.4 236.4 3678.4
Add columns that list the date, day, and day of week. This will allow us to calculate order for each day, or week.
sep_order_list_v4$date <- as.Date(sep_order_list_v4$time_of_order_gmt)
sep_order_list_v4$day <- format(as.Date(sep_order_list_v4$time_of_order_gmt), "%d")
sep_order_list_v4$day_of_week <- format(as.Date(sep_order_list_v4$time_of_order_gmt), "%A")
sep_order_list_v4$week <- strftime(sep_order_list_v4$date, format = "%V")
str(sep_order_list_v4)
## 'data.frame': 8076 obs. of 20 variables:
## $ order_id : int 1768862 1768863 1768864 1768865 1768866 1768868 1768869 1768870 1768871 1768872 ...
## $ customer_fullname: chr "Bradley Morin" "Sarra-Lynn Peters" "MarciaRose Atkinson" "Kyle Malaka" ...
## $ customer_email : chr "bradleymorin@gmail.com" "slp913@gmail.com" "cisaww@gmail.com" "Chibbi84@hotmail.com" ...
## $ order_status : chr "completed" "completed" "completed" "completed" ...
## $ no_of_items : int 4 4 10 2 3 3 11 17 6 7 ...
## $ order_total : num 270 196 210 185 221 ...
## $ billing : chr "337 Canyon Close - Canmore - Alberta - T1W 1H4 - Canada" "1916 89st NW - edmonton - Alberta - T6K 2A3 - Canada" "851 Thirteenth Street - New Westminster - British Columbia - V3M 5Z7 - Canada" "156 bella vista trail - Alliston - Ontario - L9R 2G8 - Canada" ...
## $ shipping : chr "337 Canyon Close - Canmore - Alberta - T1W 1H4 - Canada" "1916 89st NW - edmonton - Alberta - T6K 2A3 - Canada" "851 Thirteenth Street - New Westminster - British Columbia - V3M 5Z7 - Canada" "156 bella vista trail - Alliston - Ontario - L9R 2G8 - Canada" ...
## $ shipping_method : chr "Free shipping" "Free shipping" "Free shipping" "Free shipping" ...
## $ tracking_id : chr "9.73E+15" "9.73E+15" "9.73E+15" "9.73E+15" ...
## $ time_of_order_gmt: chr "2021-09-01T07:06:54" "2021-09-01T07:12:10" "2021-09-01T07:56:55" "2021-09-01T08:37:17" ...
## $ line_items : chr "2 Ounces for $300^ quantity 1^ SKU 2002-001^ PID 58323; Pink Kush - 28g^ quantity 1^ SKU F1-PIK-28^ PID 5264^ V"| __truncated__ "THC Capsules Super 100mg (Array)^ quantity 1^ SKU ARY-CAP-SPE-C0-T^ PID 336827; THC Distillate Vaporizer Cartri"| __truncated__ "Free Indica Pre-Rolled Joint^ quantity 1^ SKU GFT-PRE-IPJ-00-T^ PID 63009; Hooti Extracts Distillate Pen Cartri"| __truncated__ "Peanut Butter Breath - 28 Grams^ quantity 1^ SKU F1-PEB-28^ PID 1754670^ VID 1754675; Free Indica Pre-Rolled Jo"| __truncated__ ...
## $ total_discount : num 30 3.85 5 24.3 0 ...
## $ date_paid : chr "2021-09-01T00:29:03" "2021-09-01T00:29:54" "2021-09-01T10:36:26" "2021-09-04T13:13:08" ...
## $ date_paid_gmt : chr "2021-09-01T07:29:03" "2021-09-01T07:29:54" "2021-09-01T17:36:26" "2021-09-04T20:13:08" ...
## $ province : chr "Alberta" "Alberta" "British Columbia" "Ontario" ...
## $ date : Date, format: "2021-09-01" "2021-09-01" ...
## $ day : chr "01" "01" "01" "01" ...
## $ day_of_week : chr "Wednesday" "Wednesday" "Wednesday" "Wednesday" ...
## $ week : chr "35" "35" "35" "35" ...
Check missing date (if any) in date_of_order_gmt column:
FullSeq <- seq.Date(from = min(sep_order_list_v4$date), to = max(sep_order_list_v4$date), by = 1)
Missing <- FullSeq[!FullSeq %in% sep_order_list_v4$date]
This dataset has full date of 2021- Sep and 2021-Oct-01 ### STEP 4: Analyze
4.1. Total Sale Revenue
4.1.1. Daily Total Sale Revenue
order_by_day<- sep_order_list_v4 %>% group_by(date) %>%
drop_na(order_total) %>%
summarise_at(vars(starts_with('order_total')), sum)
order_by_day %>%
ggplot(aes(x = date, y = order_total)) +
geom_col(position = "dodge") +
labs(title = "Total Daily Sale Revenue")+
xlab("Date")+ylab("Total Revenue")+
theme(plot.background = element_rect(fill = "lightblue")) +
theme(text = element_text(size=10),
axis.text.x = element_text(angle=45, hjust=1))
4.1.2. Weekly Total Sale Revenue
weekly_order <- sep_order_list_v4 %>% group_by(week) %>%
drop_na(order_total) %>%
summarise_at(vars(starts_with('order_total')), sum)
weekly_order %>%
ggplot(aes(x = week, y = order_total, fill= week)) +
geom_col(position = "dodge") +
labs(title = "Total Weekly Sale Revenue")+
xlab("Week")+ylab("Total Sale Revenue")+
theme(plot.background = element_rect(fill = "lightblue")) +
theme(text = element_text(size=10))
4.1.3. Total Sale Revenue by day of week
weekday_order <- sep_order_list_v4 %>% group_by(day_of_week) %>%
drop_na(order_total) %>%
summarise_at(vars(starts_with('order_total')), sum)
Visualize Total Sale Revenue by day of week
weekday_order %>%
ggplot(aes(x = day_of_week, y = order_total, fill = day_of_week)) +
geom_col(position = "dodge") +
labs(title = "Total Sale Revenue by Day of Week")+
xlab("Day")+ylab("Total Sale Revenue")
4.1.4. Total Sale Revenue by Province
order_by_province <- sep_order_list_v4 %>%
group_by(province) %>%
drop_na(province) %>%
summarise_at(vars(starts_with('order_total')), sum)
order_by_province %>%
ggplot(aes(x = province, y = order_total, fill=province)) +
geom_col(position = "dodge") +
labs(title = "Total Sale Revenue by Province")+
xlab("Province")+ylab("Total Sale Revenue")+
theme(plot.background = element_rect(fill = "lightblue")) +
theme(text = element_text(size=10),
axis.text.x = element_text(angle=45, hjust=1))
4.2. Average order size
4.2.1. Average order size by day
avg_daily_order <- sep_order_list_v4 %>%
group_by(date) %>%
arrange(date) %>%
dplyr::summarise(avg_order = mean(order_total))
avg_daily_order %>%
ggplot(aes(x = date, y = avg_order)) +
geom_col(position = "dodge") +
labs(title = "Average daily order size")+
xlab("Date")+ylab("Average Order Size")+
theme(plot.background = element_rect(fill = "lightblue"))
4.2.2. Average order size by week
avg_weekly_order <- sep_order_list_v4 %>%
group_by(week) %>%
arrange(week) %>%
dplyr::summarise(avg_order = mean(order_total))
avg_weekly_order %>%
ggplot(aes(x = week, y = avg_order, fill = week)) +
geom_col(position = "dodge") +
labs(title = "Average weekly order size")+
xlab("Week")+ylab("Average Order Size")+
theme(plot.background = element_rect(fill = "lightblue"))
4.2.3. Average order size by Province
avg_order_by_province <- sep_order_list_v4 %>%
group_by(province) %>%
drop_na(province) %>%
dplyr:: summarise(avg_order = mean(order_total))
avg_order_by_province %>%
ggplot(aes(x = province, y = avg_order, fill = province)) +
geom_col(position = "dodge") +
labs(title = "Average order by Province")+
xlab("Province")+ylab("Average Order Size")+
theme(plot.background = element_rect(fill = "lightblue")) +
theme(text = element_text(size=8),
axis.text.x = element_text(angle=45, hjust=1))
4.3. Total discount
4.3.1. Total discount by day
discount_by_day <- sep_order_list_v4 %>%
group_by(date) %>%
summarise_at(vars(starts_with('total_discount')), sum)
discount_by_day %>%
ggplot(aes(x = date, y = total_discount)) +
geom_col(position = "dodge") +
labs(title = "Total Discount by day")+
xlab("Day")+ylab("Total Discount")+
theme(plot.background = element_rect(fill = "lightblue")) +
theme(text = element_text(size=10),
axis.text.x = element_text(angle=45, hjust=1))
4.3.2. Total discount by week
discount_by_week <- sep_order_list_v4 %>%
group_by(week) %>%
summarise_at(vars(starts_with('total_discount')), sum)
discount_by_week %>%
ggplot(aes(x = week, y = total_discount, fill=week)) +
geom_col(position = "dodge") +
labs(title = "Total Discount by week")+
xlab("Week")+ylab("Total Discount")+
theme(plot.background = element_rect(fill = "lightblue"))
4.3.3. Total Discount by Day_of_Week
discount_by_dayofweek <- sep_order_list_v4 %>%
group_by(day_of_week) %>%
summarise_at(vars(starts_with('total_discount')), sum)
discount_by_dayofweek %>%
ggplot(aes(x = day_of_week, y = total_discount, fill= day_of_week)) +
geom_col(position = "dodge") +
labs(title = "Total Discount by Day_of_Week")+
xlab("Day")+ylab("Total Discount")+
theme(plot.background = element_rect(fill = "lightblue")) +
theme(text = element_text(size=10),
axis.text.x = element_text(angle=45, hjust=1))
4.3.4. Total discount by Province
discount_by_province <- sep_order_list_v4 %>%
group_by(province) %>%
drop_na(province) %>%
summarise_at(vars(starts_with('total_discount')), sum)
discount_by_province %>%
ggplot(aes(x = province, y = total_discount, fill= province)) +
geom_col(position = "dodge") +
labs(title = "Total Discount by Province")+
xlab("Province")+ylab("Total Discount")+
theme(plot.background = element_rect(fill = "lightblue")) +
theme(text = element_text(size=10),
axis.text.x = element_text(angle=45, hjust=1))
4.4. Most purchased items
We need to separate item from line_items first
df <- sep_order_list_v4 %>%
janitor::clean_names() %>%
select(date,week,order_id, province, line_items)
df_line_split <- df %>%
dplyr:: mutate(line_items = str_split(line_items, "\\; ")) %>%
unnest(line_items) %>%
dplyr:: mutate(order_id_plus_row_num = paste0(order_id, "-", row_number()))
df_line_item_split <- df_line_split %>%
mutate(line_items = str_split(line_items, "\\^ ")) %>%
unnest(line_items)
df_line_item_split_named <- df_line_item_split %>%
mutate(line_items = ifelse(
!str_detect(line_items, "^quantity|^SKU|^PID|^VID"), paste("product", line_items), line_items
))
df_line_item_split_named_separated <- df_line_item_split_named %>%
extract(col="line_items", into = c("name","value"), regex="^(\\S+)\\s+(.*)") %>%
filter(!is.na(name))
final_item <- df_line_item_split_named_separated %>%
pivot_wider(names_from = name, values_from = value)
Change quantity column format to numeric:
final_item$quantity<- as.numeric(as.character(final_item$quantity))
is.numeric(final_item$quantity)
## [1] TRUE
Clean ; in cells of PID column
final_item$PID <-gsub(";","",as.character(final_item$PID))
Find most purchased_items:
most_purchased_items <- final_item %>%
filter(!is.na(PID)) %>%
dplyr::count(PID, product, sort = TRUE)
most_purchased_items %>%
slice_max(n, n=50)
Less popular items:
most_purchased_items %>%
slice_min(n, n=1)
4.4.1. Find most purchased item by date:
most_purchased_item_by_date <- final_item %>%
group_by(date) %>%
filter(!is.na(PID)) %>%
dplyr::count(PID,product) %>%
slice_max(n, n=1)
most_purchased_item_by_date %>%
ggplot(mapping = aes(x = date, y = n, color = product) ) + geom_point()+
labs(title = "Most purchased item by date")+
xlab("Date")+ylab("Total of items")+
theme(plot.background = element_rect(fill = "lightblue")) +
theme(text = element_text(size=10),
axis.text.x = element_text(angle=90, hjust=1))
4.4.2. Most purchased item by week
most_purchased_item_by_week <- final_item %>%
group_by(week) %>% filter(!is.na(PID)) %>%
dplyr::count(PID,product) %>%
slice_max(n, n=3)
most_purchased_item_by_week %>%
ggplot(mapping = aes(x = week, y = n, color = product) ) + geom_point()+
labs(title = "Most purchased item by week")+
xlab("Week")+ylab("Total of items")+
theme(plot.background = element_rect(fill = "lightblue")) +
theme(text = element_text(size=10),
axis.text.x = element_text(angle=90, hjust=1))
4.4.3. Most purchased items by Province
most_purchased_item_by_province <- final_item %>%
drop_na(province) %>%
group_by(province) %>%
filter(!is.na(PID)) %>%
dplyr::count(PID,product) %>%
slice_max(n, n=3)
most_purchased_item_by_province %>%
ggplot(mapping = aes(x = province, y = n, color = product) ) + geom_point()+
labs(title = "Most purchased item by Province")+
xlab("Province")+ylab("Total of items")+
theme(plot.background = element_rect(fill = "lightblue")) +
theme(text = element_text(size=10),
axis.text.x = element_text(angle=45, hjust=1))
5.1. Total Order vs Total Discount:
sep_order_list_v4 %>%
select(order_total,
total_discount) %>%
summary()
## order_total total_discount
## Min. : 0.0 Min. : 0.00
## 1st Qu.: 150.1 1st Qu.: 0.00
## Median : 168.0 Median : 10.45
## Mean : 203.4 Mean : 17.75
## 3rd Qu.: 236.4 3rd Qu.: 25.00
## Max. :3678.4 Max. :791.12
ggplot(data=sep_order_list_v4, aes(x=total_discount, y=order_total)) + geom_point() +
labs(title = "Total Order vs Total Discount")+
xlab("Total Discount")+ylab("Total Order")+
geom_smooth(se = FALSE, method = lm) +
theme(plot.background = element_rect(fill = "lightblue"))
## `geom_smooth()` using formula 'y ~ x'
4.5. Most purchased customers
sep_order_list_v4 %>%
group_by(customer_email) %>%
summarise_at(vars(starts_with('order_total')), sum) %>%
slice_max(order_total, n= 10)
### STEP 5: Share
There are some interesting insights I found in this data: Sales revenues, total discounts used was recorded in only September and can be biased since the trend and pattern can change by month.
1) It seems that day_of_week have a positive impact for the business. The busiest days of the week for sales on HA shop are Thursday and Friday, lowest days for sale are Saturday and Sunday.
2) Geographic influences: Ontario was the biggest contributor to sale revenues in September, the second contributor is British Columbia. Northwest Territories and Yukon were recorded the lowest sale revenues.
3) Discounts affect to sales revenues: from the Total Order vs Total Discount graph we clearly see that if discounts help to increase sale revenue. Because the discounts attract more people, you have more potential buyers for other items in your store, as most people will look around to see what you offer before making a purchase.
4) According to most purchased item by date graph, we see Deal of The Day(PID388713) is most purchased item, it is also clearly that discount/promation help to boost sales revenue.
Top 5 best-selling items in September: 1. PID388713 Deal Of The Day
2. PID63008 Free Sativa Pre-Rolled Joint
3. PID63009 Free Indica Pre-Rolled Joint
4. PID983545 200mg THC Assorted Gummies (Sugar Jack’s)
5. PID1746535 Free THC Snack Pack
* Top 5 daily best-selling items: 1. PID388713 Deal Of The Day 2. PID983545 200mg THC Assorted Gummies (Sugar Jack’s)
3. PID63008 Free Sativa Pre-Rolled Joint 4. PID4505 Death Bubba - 28g 5. PID63009 Free Indica Pre-Rolled Joint
Recommendations for HA:
* First-time shopping offers: Offering introductory discounts, or have specials such as buy 2-get-1-for half-price or free gift for the first three purchases. Especially, we should have exclusive promotion programs in specific province such as: Yukon and Northwest Territories to attract new customers and increase sale in these province.
* Loyalty program: We can offer exclusive discounts or promotions to customer base — they’ll feel both excited and valued. Example, we can have monthly reward (platinum reward, gold reward, sliver reward) for customers who spent more than $C2,000, $C1500, $C500…
* Get rid of old inventory.: We can sell popular and less popular products together to get rid of old inventory. Try creating kits that include less popular but related products to sell more of inventory.
* Volume discounts: The more you buy, the more you save! Offering discounts for wholesale buying helps store gets rid of some inventory, sells more items, and keep customers happy.
* Weekend special offer campaigns: We see that total order volume is lowest on Saturday and Sunday. Offer a weekend-only to boost more sales in the weekend. Customers only get those offer on Saturday, or Sunday.